Training Classical and Quantum Algorithms for Orthogonal Neural Networks

ABSTRACT

Orthogonal neural networks impose orthogonality on the weight matrices. They may achieve higher accuracy and avoid evanescent or explosive gradients for deep architectures. Several classical gradient descent methods have been proposed to preserve orthogonality while updating the weight matrices, but these techniques suffer from long running times and provide only approximate orthogonality. In this disclosure, we introduce a new type of neural network layer. The layer allows for gradient descent with perfect orthogonality with the same asymptotic running time as a standard layer. The layer is inspired by quantum computing and can therefore be applied on a classical computing system as well as on a quantum computing system. It may be used as a building block for quantum neural networks and fast orthogonal neural networks.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to Indian Patent Application No.202141023642, “Quantum Orthogonal Neural Networks,” filed on May 27,2021. The subject matter of which is incorporated herein by reference inits entirety.

BACKGROUND 1. Technical Field

This disclosure relates generally to neural networks, and moreparticularly, to training and using orthogonal neural networks using aquantum computing system or a classical computing system.

2. Description of Related Art

In the evolution of neural networks structures, adding constraints tothe weight matrices has often been an effective path. For example,orthogonal neural networks (OrthoNNs) have been proposed as a new typeof neural network for which, at each layer, the weights matrix shouldremain orthogonal. This property is useful to reach higher accuracyperformance and avoid vanishing or exploding gradient for deeparchitectures. Several classical gradient descent methods have beenproposed to preserve the orthogonality while updating the weightsmatrices. However, these techniques suffer from longer running time andsometimes only approximate the orthogonality. In particular, the mainmethod for achieving orthogonality during training is to first performgradient descent to update the weights matrix (which is now not going tobe orthogonal) and then perform Singular Value Decomposition toorthogonalize or almost orthogonalize the weights matrix. However,achieving orthogonality hinders a fast training process, since at everystep an SVD computation needs to be performed.

In the emergent field of quantum machine learning, several proposalshave been made to implement neural networks. Some algorithms rely onlong term and perfect quantum computers, while others try to harness theexisting quantum devices using variational circuits. However, it isunclear how such architectures scale and whether they provide efficientand accurate training.

SUMMARY

This disclosure describes novel approaches for machine learningalgorithms, such as deep learning algorithms. This disclosure describesa class of Neural Networks that has the property of having orthogonalweight matrices. This is an improved technique for approximating certainfunctions, like those for the classification of data due to the reasonsdescribed below. The neural networks described constructed may also beoptimized in terms of the number of gates, scaling time of training, andtype of gates in the circuit.

Orthogonal neural networks may provide an advantage for deep neuralnetworks, that is neural networks with a large number of layers. Theymay preserve the norms both during the forward and backward pass. Thisproperty enables the prevention of gradient vanishing and explosion,which is prominent in deep neural networks. Such neural networks alsohave the property of non-redundancy in the weights since the vectors areorthogonal and linearly independent, thereby each of them giving“different” information about the input-output relation.

Some embodiments relate to a quantum architecture for a connected neuralnetwork that offers orthogonality in the weight matrices. In someembodiments, the neural network comprises a quantum circuit shaped likean inverted pyramid for each layer of the neural network.

Some embodiments relate to using a unary preserving quantum circuit(e.g., with BS gates as described in Section 2) to form a layer of anorthogonal neural network. The layer may be trained in O(n) time where nis the number of input nodes of the layer. Data may be loaded into thelayer using a data loader e.g., as described in Section 2.

Other aspects include components, devices, systems, improvements,methods, processes, applications, computer readable mediums, and othertechnologies related to any of the above.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the disclosure have other advantages and features whichwill be more readily apparent from the following detailed descriptionand the appended claims, when taken in conjunction with the examples inthe accompanying drawings, in which:

FIG. 1 is a diagram that represents the quantum mapping for a BS quantumgate.

FIG. 2A is a diagram of a quantum circuit for an 8×8 orthogonal layer ofa neural network.

FIG. 2B is a diagram of a classical 8×8 orthogonal layer of a neuralnetwork.

FIG. 3A is a diagram of a quantum circuit for a rectangular 8×4orthogonal layer of a neural network.

FIG. 3B is a diagram of a classical 8×4 orthogonal layer of a neuralnetwork.

FIG. 4A is a diagram of a quantum circuit that includes a linear cascadedata loader circuit.

FIGS. 4B-D are diagrams of other data loader circuits.

FIG. 5 is a diagram that illustrates paths from a 7^(th) unary state toa 6^(th) unary state on an 8×8 quantum pyramidal circuit.

FIG. 6 is a diagram of a three-qubit pyramidal circuit and an orthogonalmatrix associated with the circuit.

FIGS. 7A-7C are diagrams of a pyramidal circuit applied on a loadedvector x with two non-zero values.

FIG. 8A is a diagram of a classical pyramidal circuit.

FIG. 8B is a diagram of a 4×4 layer of a neural network.

FIG. 9 is a diagram of a pyramid circuit that includes notation used fortraining.

FIG. 10 is a plot of numerical experiments.

FIG. 11 is a diagram of a quantum circuit configured to implement a BSgate.

FIGS. 12A-12C are diagrams of quantum circuits used to obtain the signsof a vector's components.

FIG. 13 is a diagram of another quantum circuit used to obtain the signsof a vector's components.

FIG. 14A is a diagram or a neural network with three layers.

FIG. 14B is a diagram of a quantum circuit for the neural network inFIG. 14A.

FIG. 15 is a diagram that illustrates a matrix to angles conversiontraversal for a 6×6 layer.

FIG. 16 is a flowchart of a method for executing a quantum circuit toimplement a layer of a neural network.

FIG. 17 is a flowchart of a method for training a layer of a neuralnetwork with an orthogonal weight matrix.

FIGS. 18A-18E are diagrams of example quantum circuits.

FIG. 19 is a diagram of another quantum circuit.

FIG. 20A is a block diagram that illustrates a computing system.

FIG. 20B is a block diagram that illustrates a quantum computing system.

FIG. 20C is a block diagram of a qubit register.

FIG. 20D is a flow chart that illustrates an example execution of aquantum routine on a computing system.

FIG. 21 is an example architecture of a classical computing system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The figures and the following description relate to preferredembodiments by way of illustration only. It should be noted that fromthe following discussion, alternative embodiments of the structures andmethods disclosed herein will be readily recognized as viablealternatives that may be employed without departing from the principlesof what is claimed.

1. Introduction

This disclosure presents a new training method for neural networks thatpreserves (e.g., perfect) orthogonality while having the same runningtime as usual gradient descent methods without the orthogonalitycondition, thus achieving the best of both worlds, most efficienttraining and perfect orthogonality.

One of the main ideas comes from the quantum world, where any quantumcircuit corresponds to an operation described by a unitary matrix, whichif we only use gates with real amplitudes, is an orthogonal matrix. Inparticular, this disclosure describes a novel special-architecturequantum circuit, for which there is an efficient way to map the elementsof an orthogonal weights matrix to the parameters of the gates of thequantum circuit and vice versa. Thus, while performing a gradientdescent on the elements of the weights matrix individually does notpreserve orthogonality, performing a gradient descent on the parametersof the quantum circuit does preserve orthogonality (since any quantumcircuit with real parameters corresponds to an orthogonal matrix) and isequivalent to updating the weights matrix. This disclosure also provesthat performing gradient descent on the parameters of the quantumcircuit can be done efficiently classically (with constant update costper parameter) thus concluding that there exists a quantum-inspired, butfully classical way of efficiently training perfectly orthogonal neuralnetworks.

Moreover, the special-architecture quantum circuit defined herein hasmany properties that make it a good candidate for NISQ (NoisyIntermediate-Scale Quantum) implementations: it may use only one type ofquantum gate, may use a simple connectivity between the qubits, may havedepth linear in the input and output node sizes, and may benefit frompowerful error mitigation techniques that make it resilient to noise.This allows us to also propose an inference method running the quantumcircuit on data which might offer a faster running time (e.g., given theshallow depth of the quantum circuit).

Some of our contributions are summarized in Table 1 (below), where wehave considered the time to perform a feedforward pass, or one gradientdescent step. A single neural network layer is considered, with inputand output of size n. For example, the methods described in thisdisclosure are just as fast as other methods during the feedforwardpass. Additionally, the algorithms in this disclosure are faster thanother orthogonal methods and just as fast as non-orthogonal methods inthe matrix update process.

TABLE 1 Running times summary. n is the size of the input and outputvectors, δ is the error parameter in the quantum implementation.Feedforward Orthogonal Method Pass matrix update Quantum PyramidalCircuit 2n/δ² = O(n/δ²)    O(n²/δ²) (This disclosure) ClassicalPyramidal Circuit 2n(n − 1) = O(n²)     O(n²) (This disclosure)Classical Approximated n² = O(n²) O(n³) OrthoNN (SVB) Classical StrictOrthoNN n² = O(n²) O(n³) (Stiefel Manifold) Standard Neural Network n² =O(n²) O(n²) (non orthogonal)

2. A Parametrized Quantum Circuit for Orthogonal Neural Networks

In this section we define a special-architecture parametrized quantumcircuit that may be useful for performing training and inference onorthogonal neural networks. As we said, the training may be (e.g.,completely) classical in the end, but the intuition of the new methodcomes from this quantum circuit, while the inference can happenclassically or by applying this quantum circuit. A basic introduction toquantum computing concepts for this work is given in Sections 9 and 12.

2.1 The BS Gate

The quantum circuits proposed in this work that implement fullyconnected neural network layers with orthogonal weight matrices may useonly one type of quantum gate: the Reconfigurable Beam Splitter (BS)gate. The BS gate is a parametrizable two-qubit gate. This two-qubitgate may be considered hardware efficient, and it may have oneparameter: angle θ∈[0, 2π]. An example matrix representation of the BSgate is given as:

$\begin{matrix}{{B{S(\theta)}} = {{\begin{pmatrix}1 & 0 & 0 & 0 \\0 & {\cos\theta} & {\sin\theta} & 0 \\0 & {{- s}{in}\theta} & {\cos\theta} & 0 \\0 & 0 & 0 & 1\end{pmatrix}{{BS}(\theta)}}:\left\{ \begin{matrix}\left. {\left. \left. {❘01} \right\rangle\mapsto{\cos\theta{❘01}} \right\rangle - {\sin\theta{❘10}}} \right\rangle \\\left. {\left. \left. {❘10} \right\rangle\mapsto{\sin\theta{❘01}} \right\rangle + {\cos\theta{❘10}}} \right\rangle\end{matrix} \right.}} & (1)\end{matrix}$

The BS gate may be represented by other similar matrices. For example,the rows and columns of the above matrix can be permuted, a phaseelement e^(ip) may be introduced instead of the “1” at matrix position(4,4), or the two elements sin(θ) and −sin(θ) may be changed to, forexample, i*sin(θ) and i*sin(θ). The above BS gate can also be decomposedin a set of two- and one-qubit parametrized gates. All these gates arepractically equivalent, and our methods can use any of them. Thus, asused herein, “BS gate” may refer to any of these gates. Here are somespecific examples of alternative BS gates, however, this list is notexhaustive:

BS ₁(θ)=[[1,0,0,0],[0, cos(θ),−i*sin(θ),0],[0,−i*sin(θ),cos(θ),0],[0,0,0,1]];

BS ₂(θ)=[[1,0,0,0],[0, cos(θ), sin(θ),0],[0,sin(θ),−cos(θ),0],[0,0,0,1]];

BS ₃(θ,φ)=[[1,0,0,0],[0, cos(θ),−i*sin(θ),0],[0,−i*sin(θ),cos(θ),0],[0,0,0,e ^(−iφ)]]; and

BS ₄(θ,φ)=[[1,0,0,0],[0, cos(θ), sin(θ),0],[0,−sin(θ),cos(θ),0],[0,0,0,e ^(−iφ)]].

We can think of the BS gate as a rotation in the two-dimensionalsubspace spanned by the basis {|01

, |10

, while it acts as the identity in the remaining sub space {|00

, |11

}. Or equivalently, starting with two qubits, one in the |0

state and the other one in the state |1

, the qubits can be swapped or not in superposition. The qubit |1

stays on its wire with amplitude cos θ or switches with the other qubitwith amplitude+sin θ if the new wire is below (|10

|01

) or −sin θ if the new wire is above (|01

|10

). Note that in the two other cases (|00

and |11

) the BS gate acts as identity. FIG. 1 is a diagram that represents thequantum mapping of a BS gate on two qubits.

2.2 Quantum Pyramidal Circuit

We now propose a quantum circuit that implements an orthogonal layer ofa neural network. The circuit is a pyramidal structure of BS gates, eachwith an independent angle. More details are provided below concerningthe input loading and the equivalence with a neural network's orthogonallayer.

To mimic a given classical layer with a quantum circuit, the number ofoutput qubits may be the size of the classical layer's output. We referto the square case when the input and output sizes are equal, and to therectangular case otherwise.

One property to note is that the number of parameters of the quantumpyramidal circuit corresponding to a neural network layer of size n×d is(2n−1−d)*d/2, which is the same as the number of degrees of freedom ofan orthogonal matrix of dimension n×d (the least number of parametersthat uniquely define the orthogonal matrix).

FIGS. 2A and 2B are example diagrams. FIG. 2A illustrates a quantumcircuit for an 8×8 fully connected, orthogonal layer of a neuralnetwork. Each vertical line corresponds to an BS gate with its angleparameter θ_(i). Note that FIG. 2A includes qubit labels. Forsimplicity, other circuit diagrams herein do not include qubit labels.FIG. 2B illustrates the equivalent classical orthogonal neural network8×8 layer.

For simplicity, our analysis considers the square case (i.e., n inputnodes and n output nodes) but everything can be easily extended to therectangular case (i.e., n input nodes and p≠n output nodes). As stated,the pyramidal structure of the quantum circuit described above imposesthe number of free parameters to be N=n(n−1)/2, which is the exactnumber of free parameters to specify an n×n orthogonal matrix. Saiddifferently, there is an efficient one-to-one mapping between theN=n(n−1)/2 parameter angles {θ_(i): i ∈[N]} of the gates in the invertedpyramid and the N=n(n−1)/2 degrees of freedom of an n×n orthogonalmatrix W with entries w_(ij). In the example case of FIG. 2A with n=8,we have N=28, which is the number of gates and the number of freeelements of an 8×8 orthogonal matrix.

In Section 3 we show how the parameters of the gates of this pyramidalcircuit can be related to the elements of the orthogonal matrix of sizen×n that describes it. We note that alternative architectures can beimagined as long as the number of gate parameters is equal to theparameters of the orthogonal weights matrix and a (e.g., simple) mappingbetween them and the elements of the weights matrix can be found.

Note that this pyramid circuit has linear depth and is convenient fornear term quantum hardware platform with restricted connectivity.Indeed, the example distribution of the BS gates uses only nearestneighbor connectivity between qubits in the circuit diagram. However,alternative versions may or may not use nearest neighbor connectivity(examples later).

FIG. 3A is a diagram of a quantum circuit for a rectangular 8×4 fullyconnected orthogonal layer, and FIG. 3B is a diagram of the equivalent8×4 classical orthogonal neural network. They both have 22 freeparameters.

Although FIGS. 2A and 3A illustrate gates arranged in an invertedpyramid shape, other shapes are possible to implement a layer of aneural network. For example, BS gates may be arranged according to aright side up pyramid. In another example, the BS gates may be arrangedaccording to a triangle shape. Furthermore, the inverted pyramid shapesillustrated in FIGS. 2A and 3A are a result of the BS gates beingapplied to adjacent qubits in the circuit diagram. However, in otherembodiments, one or more BS gates may be applied to qubits that are notadjacent in the circuit diagram. In these embodiments, the BS gates mayform other shapes (e.g., non-pyramid shapes). Descriptions of pyramidalcircuits in this disclosure may be applicable to these other circuitarrangements. Additionally, Section 11 describes different types ofcircuits that can implement a layer of a neural network. Descriptions ofpyramidal circuits in this disclosure may also be applicable to thesecircuits.

2.3 Loading the Data

Before applying the quantum pyramidal circuit, the classical data may beuploaded into the quantum circuit. We may use one qubit per feature ofthe data. For this, we use a unary amplitude encoding of the input data.Let's consider an input sample x=(x₀, . . . , x_(n−1)) E

^(n), such that ∥x∥₂=1. The sample can be encoded in a superposition ofunary states:

|x

=x ₀|10 . . . 0

+x ₁|010 . . . 0

+ . . . +x _(n−1)|0 . . . 01

  (2)

The previous state can be rewritten using |e_(i)

to represent the i^(th) unary state with a |1

in the i^(th) position |0 . . . 010 . . . 0

, as:

$\begin{matrix}\left. {\left. {❘x} \right\rangle = {\sum\limits_{i = 0}^{n - 1}{x_{i}{❘e_{i}}}}} \right\rangle & (3)\end{matrix}$

Although a logarithmic depth data loader circuit can be used for loadingsuch states, a simpler circuit may be used. It is a linear depth cascadeof n−1 BS gates which, due to the structure of our quantum pyramidalcircuit, may only add 2 extra steps to our pyramid circuit. An exampleof this linear depth cascade circuit (also referred to as the “diagonalloader”) is illustrated in FIG. 4A.

FIG. 4A is a diagram that includes an 8-dimensional linear data loadercircuit 405. The example circuit 405 includes a linear cascade of BSgates with parameters α_(i). The loader circuit 405 is efficientlyembedded before the pyramidal circuit 410. The input state in FIG. 4A isthe first unary state (|10 . . . 0

). The angles parameters α₀, . . . , α_(n−2) may be classicallypre-computed from the input vector. Note that the data loader circuit405 includes an X gate (not illustrated) that flips the first qubit fromthe |0

state to the |1

state.

Generally, a data loader circuit starts in the all |0

state and flips a first qubit using an X gate, in order to obtain theunary state |10 . . . 0

(e.g., as shown on FIG. 4A). Then a cascade of BS gates allow to createthe state |x

using a set of n−1 angles α₀, . . . , α_(n−2). Using Eq.(1), angles arechosen such that, after the first BS gate of the loader, the qubitswould be in the state x₀|100 . . .

+sin(α₀)|010 . . .

and after the second one in the state x₀|100 . . .

+x₁|010 . . .

+sin(α₀) sin(α₁)|001 . . .

and so on, until obtaining |x

as in Eq.(2). To this end, classical preprocessing may be performed tocompute recursively the n−1 loading angles, in time O(n):

$\begin{matrix}\left\{ \begin{matrix}{\alpha_{0} = {\arccos\left( x_{0} \right)}} \\{\alpha_{1} = {\arccos\left( \frac{x_{1}}{\sin\left( \alpha_{0} \right)} \right)}} \\{\alpha_{2} = {\arccos\left( \frac{x_{2}}{{\sin\left( \alpha_{0} \right)}{\sin\left( \alpha_{1} \right)}} \right)}} \\\ldots\end{matrix} \right. & (4)\end{matrix}$

The ability of loading data in such a way uses the assumption that eachinput vector is normalized, i.e. ∥x∥₂=1. This normalization constraintcould seem arbitrary and impact the ability to learn from the data. Infact, in the case of orthogonal neural network, this normalizationshouldn't degrade the training because orthogonal weight matrices are infact orthonormal and thus norm-preserving. Hence, changing the norm ofthe input vector, by dividing each component by ∥x∥₂, in both classicaland quantum setting is not a problem. The normalization would imposethat each input has the same norm, or the same “luminosity” in thecontext of images, which can be helpful or harmful depending on the usecase.

2.4 Additional Information on Data Loader Circuits

The first step of the data loading, given access to a classical datapoint (e.g., x=(x₁, x₂, . . . , x_(d))), is to pre-process the classicaldata efficiently, e.g., spending only O(d) total time (where thelogarithmic factors are hidden), in order to create a set of parameters(e.g., θ=(θ₁, θ₂, . . . , θ_(d−1))), that will be the parameters of the(d−1) two-qubit gates used in our quantum data loader circuit. Duringpre-processing, we may also keep track of the norms of the vectors. Notethat these angles parameters are different depending on which dataloader circuit is used.

We may use three different types of data loader circuits. FIGS. 4B-Dillustrate these different data loader circuits for eight qubits.Specifically, FIG. 4B is a diagram of a “parallel loader circuit,” FIG.4C is a diagram of another diagonal loader circuit, and FIG. 4D is adiagram of a “semi-diagonal loader circuit.” The X in each figurecorresponds to the single-qubit Pauli X gate, while vertical linesrepresent the two-qubit BS gates.

The shallowest data loader circuit is the parallel data loader circuit(example in FIG. 4B), which loads d-dimensional data points using dqubits and d−1 BS gates. The parallel loader circuit has a depth oflog(d)+1. While this data loader may have the smallest depth of thethree different types of data loaders, it may also have the highestqubit connectivity. In other words, the circuit diagrams of paralleldata loaders may have the greatest number of BS gates that are appliedto non-nearest neighbor qubits.

An example method for constructing a parallel data loader circuit is thefollowing. We start with all qubits initialized to the 0 state. In thefirst step, we apply an X gate on the first qubit. Then, the circuit isconstructed by adding BS gates in layers, using the angles θ weconstructed before. The first layer has 1 BS gate, the second layer has2 BS gates, the third layer has 4 BS gates, until the log(d)-th layerthat has d/2 gates. The qubits to which the gates are added follow atree structure (e.g., a binary tree structure). In the first layer wehave one BS gate between qubits (0,d/2) with angle θ₁, in the secondlayer we have two BS gates between (0,d/4) with angle θ₂ and (d/2,3d/4)with angle θ₃, in the third layer there are four BS gates between qubits(0,d/8) with angle θ₄, (d/4,3n/8) with angle θ₅, (d/2,5d/8) with angleθ₆, (3d/4,7d/8) with angle θ₇, and so forth for the other layers.Parallel data loader circuits are also described in U.S. patentapplication Ser. No. 16/986,553 filed on Aug. 6, 2020, which isincorporated herein by reference.

The two other types of data loader circuits may have worse asymptoticdepth (in other words, larger depths) but fewer BS gates that areapplied to non-nearest neighbor quits.

The diagonal data loader uses d qubits and d−1 BS gates that may beapplied to nearest neighboring qubits in the circuit diagram (e.g., seeFIG. 4C). However, diagonal data loaders may have a circuit depth ofd−1.

The semi-diagonal loader similarly uses d qubits and d−1 BS gates thatmay be applied to nearest neighboring qubits in the circuit diagram(e.g., see FIG. 4D). However, the semi-diagonal loader may have a depthof d/2. As illustrated in FIG. 4D, the semi-diagonal loader is similarto the diagonal loader except it starts from the middle qubit (insteadof top or bottom qubit).

To determine which data loader circuit to use, we typically choose adata loader that increases the depth the least. With the pyramidcircuit, for instance, the diagonal data loader circuit fits well,despite its large intrinsic depth (as described above). However, forother neural network layer circuits, this may not be the case. Whenthere no such trick, the parallel loader is typically preferred becauseof its small depth.

3. OrthoNNs Feedforward Pass

3.1 Brief Description of Feedforward Pass

Given the angles, one can find the unique matrix and given the matrixone can uniquely specify the angles. To get an w_(ij) entry of theweight matrix for a layer, we take the sum of expressions from (e.g.,all) possible paths from qubit j to i using the following rules:

(A) If we pass by any gate with angle θ_(n), we multiply withcos(θ_(n)).

(B) If we go up on any gate with angle θ_(n), we multiply with−sin(θ_(n)).

(C) If we go down on any gate with angle θ_(n), we multiply withsin(θ_(n)).

Calculating the weight matrix in this or similar manner can be doneefficiently using various techniques like recursion, dynamicprogramming, or applying the gates to the weight matrix in theappropriate order since this is similar to the implementation of the BSgates described above.

To obtain the angles from a given orthogonal matrix, we traverse theorthogonal matrix column by column from right to left and going frombottom to top (until before the anti-diagonal element) in each column.For example, see FIG. 15B. Since, we know the expression in terms ofsines and cosines of a subset of angles for each matrix element, we canequate it with the corresponding actual value in the orthogonal matrix.Traversing it in this manner leads to equations with only one unknownangle which can be, therefore, retrieved.

We can combine such layers sequentially to create a larger quantumneural network. Between each layer, one can measure the quantum states,apply a non-linearity, and then upload the data to the next layer. Forexample, see FIGS. 14A-14B (further described below).

We can also add another quantum layer before or after the pyramidalstructure to construct different architectures that encompass ourconstruction.

To load the data for each layer, we can use the construction in FIG. 5or any similar data loading procedure. The angles are computed from theinput vector in linear time in the dimension of the vector.

3.2 Detailed Description of Feedforward Pass

The following paragraphs further describe subject matter in Section 3.1above.

In this section we detail the effect of the quantum pyramidal circuit onan input encoded in a unary basis, as in Eq.(2). We will also see in theend how to simulate this quantum circuit classically with a smalloverhead and thus be able to provide a fully classical scheme.

Let's first consider one pure unary input, where only the qubit j is instate |1

(e.g. |00000010

). This unary input is transformed into a superposition of unary states,each with an amplitude. If we consider again only one of these possibleunary outputs, where only the qubit i is in state |1

, its amplitude can be interpreted as a conditional amplitude totransfer the |1

from qubit j to qubit i. Intuitively, this value is the sum of thequantum amplitudes associated to each possible path that connects thequbit j to qubit i, as shown in FIG. 5 .

Using this image of connectivity between input and output qubits, we canconstruct a matrix W∈

^(n×n), where each element W_(ij) is the overall conditional amplitudeto transfer the |1

from qubit j to qubit i.

FIG. 5 is a diagram that illustrates the three possible paths from the7^(th) unary state to the 6^(th) unary state on a quantum pyramidalcircuit to implement an 8×8 neural network layer. FIG. 5 shows anexample where exactly three paths can be taken to map the input qubitj=6 (the 7^(th) unary state) to the qubit i=5 (the 6^(th) unary state).Each path comes with a certain amplitude. For instance, path 505 movesup at the first gate, and then stays put in the next three gates, with aresulting amplitude of −sin(θ₁₆) cos(θ₁₇) cos(θ₂₃) cos(θ₂₄). The sum ofthe amplitudes of all possible paths give us the element W₅₆ of thematrix W (where, for simplicity, s(θ) and c(θ) respectively stand forsin(θ) and cos(θ)):

W ₅₆=−s(θ₁₆)c(θ₂₂)s(θ₂₃)−s(θ₁₆)c(θ₁₇)c(θ₂₃)c(θ₂₄)+s(θ₁₆)s(θ₁₇)c(θ₁₈)s(θ₂₄)  (5)

In fact, the n×n matrix W can be seen as the unitary matrix of ourquantum circuit if we solely consider the unary basis, which isspecified by the parameters of the quantum gates. A unitary is a complexunitary matrix, but in our case, with only real operations, the matrixis orthogonal. This proves the correspondence between any matrix W andthe pyramidal quantum circuit.

The full unitary U_(W) in the Hilbert Space of our n-qubit quantumcircuit is a 2^(n)×2^(n) matrix with the n×n matrix W embedded in it asa submatrix on the unary basis. This is achieved by loading the data asunary states and by using only BS gates that keep the number of 0s and1s constant.

For instance, as shown in FIG. 6 , a 3-qubit pyramidal circuit isdescribed as a unique 3×3 matrix, that can be verified to be orthogonal.More specifically, FIG. 6 illustrates an example of a 3-qubit pyramidalcircuit and the equivalent orthogonal matrix. c(θ) and s(θ) respectivelystand for cos(θ) and sin(θ).

In FIG. 5 , we considered the case of single unary for both the inputand output. But with actual data, as seen in Section 2.3, input andoutput states are in fact a superposition of unary states. Thanks to thelinearity of quantum mechanics in absence of measurements, the previousdescriptions remain valid and can be applied on a linear combination ofunary states.

Let's consider an input vector x ∈

^(n) encoded as a quantum state |x

=Σ_(i=0) ^(n−1)x_(i)|e_(i)

where |e_(i)

represents the i^(th) unary state, as explained in Eq.(3). By definitionof W, each unary |e_(i)

will undergo a proper evolution |e_(i)

Σ_(j=0) ^(n−1)W_(ij)|e_(j)

. This yields, by linearity, to the following mapping

$\begin{matrix}\left. \left. {❘x} \right\rangle\mapsto{\sum\limits_{i,j}{W_{ij}x_{i}{❘e_{j}}}} \right\rangle & (6)\end{matrix}$

As explained above, our quantum circuit is equivalently described by thesparse unitary U_(W)∈

² ^(n) ^(×2) ^(n) or in the unary basis by the matrix W∈

^(n×n). This can be summarized with

U _(W) |x

=|Wx

  (7)

We see from Eq.(6) and Eq.(7) that the output is in fact |y

, the unary encoding of the vector y=Wx, which is the output of a matrixmultiplication between the n×n orthogonal matrix W and the input x∈

^(n). As expected, each element of y is given by y_(k)=Σ_(i=0)^(n−1)W_(ik)x_(i). See FIGS. 7A-7C for a diagram representation of thismapping.

FIGS. 7A-7C are diagrams of a pyramidal circuit applied on a loadedvector x with two non-zero values. The output is the unary encoding ofy=Wx where W is the corresponding orthogonal matrix associated with thecircuit.

Therefore, for any given neural network's orthogonal layer, there may bea quantum pyramidal circuit that reproduces it. On the other hand, anyquantum pyramidal circuit may be implementing an orthogonal layer ofsome sort.

Additional details concerning multi-layers branching, the tomography atthe end of each layer, and the way to apply the non linearities aregiven in Section 10.

Thus, the quantum circuits proposed in this work can rightfully becalled “quantum neural networks” even though this term has been employedto arbitrary variational circuits that present some conceptualsimilarities to neural networks. With our quantum pyramidal circuits, wecontrol and understand the quantum mapping. It implements each layer andtheir non linearities, in a modular way. Our orthogonal quantum neuralnetworks are also different regarding the training strategies (seeSection 4 for details).

3.1 Classical Implementation

While the quantum pyramidal circuit is presented as the inspiration ofthe new methods for orthogonal neural networks, these quantum circuitscan be simulated classically on a classical computing system with asmall overhead, thus yielding classical methods for orthogonal neuralnetworks.

The classical algorithm may be the simulation of the quantum pyramidalcircuit, where each BS gate is replaced by a planar rotation between itstwo inputs.

As shown in FIGS. 8A and 8B, we propose a similar classical pyramidalcircuit, where each layer is constituted of

$\frac{n\left( {n - 1} \right)}{2}$

planar rotations, tor a total of

${4 \times \frac{n\left( {n - 1} \right)}{2}} = {O\left( n^{2} \right)}$

basic operations. Therefore, our single layer feedforward pass has thesame complexity O(n²) as the usual matrix multiplication.

FIGS. 8A and 8B illustrate a classical representation of a singleorthogonal layer on a 4×4 case (n=4) performing x

y=Wx. The angles and the weights can be chosen such that our classicalpyramidal circuit (FIG. 8A) and normal classical network (FIG. 8B) areequivalent. Each connecting line represent a scalar multiplication withthe value indicated. On the classical pyramidal circuit (FIG. 8A), innerlayers ζ^(λ) are displayed. A timestep corresponds to the lines inbetween two inner layers (see Section 4 for definitions).

One may still have an advantage performing the quantum circuit forinference, since the quantum circuit has depth O(n), instead of theO(n²) classical complexity of the matrix-vector multiplication.Nevertheless, as discussed see below, an advantage of our methods isthat orthogonal weights matrices may be trained classically in timeO(n²), instead of the previously best-known O(n³).

Described differently, inference on any input data can be done bysequentially applying each layer of the neural network. This isequivalent to multiplying the input by the generated orthogonal weightmatrix. For an (n×n) layer, classically this takes time O(n²), the timeto multiply an n×n matrix with an n-dimensional input vector, while thequantum circuit can perform this multiplication with O(n), steps sincethe depth of the quantum circuit is O(n).

4 OrthoNN training: Angle's Gradient Estimation and Orthogonal MatrixUpdate

4.1 Brief Description of OrthoNN Training

For clarity, the remaining paragraphs of this section rephrase the abovedescription.

Unlike in the classical feed-forward neural networks, gradient descentis performed on the BS gate angles directly and not on the weight matrixelements. This can be performed in multiple ways, such as batch gradientdescent, stochastic gradient descent, etc. with a suitable learningrate. Mathematically, the update rule may be

$\left. \theta_{i}\leftarrow{\theta_{i} - {\eta\frac{\partial C}{\partial\theta_{i}}}} \right.$

and can use different kinds of optimizers, like adam, rmsprop, yogi,etc.

To calculate the gradient of the cost function C with respect to theangle of the B S gates, the error may be backpropagated not just overthe layers of the network, but also over the mini layers (also referredto as “timesteps”) which we denote by

and δ^(λ), respectively.

is the vector representing the error (gradient) with respect to theinput

to the layer, that is

= ∂ C .

δ^(λ) is the vector representing the error (gradient) with respect tothe input ζ^(λ) to the mini layer, that is

$\delta^{\lambda} = {\frac{\partial C}{\partial\zeta^{\lambda}}.}$

The values of δ^(λ) may be calculated in the following way.δ^(λ)=(ω^(λ))^(T)·δ^(λ+1) for the weight matrix (ω^(λ))^(T) of the layerindex λ. For the last timestep, the first to be calculated, we haveδ^(λ) ^(max) =(ω_(max) ^(λ))^(T)·Δ^(l).

For calculating the values of Δ^(l), the following equations can beused:

Δ^(l−1)=δ⁰⊙σ′(z^(l)) where σ′ is the derivative of the activationfunction σ.

For an BS gate with angle θ acting on the qubits i and i+1 in the minilayer λ, the gradient calculation can be derived to be the followingexpression:

${\frac{\partial C}{\partial\theta} = {{\delta_{i}^{\lambda + 1}\left( {{{- \sin}(\theta)\zeta_{i}^{\lambda}} + {\cos(\theta)\zeta_{i + 1}^{\lambda}}} \right)} + {\delta_{i + 1}^{\lambda + 1}\left( {{{- {\cos(\theta)}}\zeta_{i}^{\lambda}} - {{\sin(\theta)}\zeta_{i + 1}^{\lambda}}} \right)}}},$

which can be calculated in constant time. See also Equation 9 and FIG. 9.

On correct and efficient implementation of the above architecture andlearning algorithm, we observe that the time taken for each layer tocalculate and update the weights scales as O(nm) for a layer with ninputs and m outputs for each data point. This is as good as theclassical non-orthogonal neural networks and provides the advantagesoffered by orthogonality. The forward pass (only inference), once themodel is trained, gives a quadratic speedup as it scales as O(n) insteadof O(nm) as in the classical case.

4.2 Detailed Description of OrthoNN Training

The following paragraphs further describe subject matter in Section 4.1above.

An introduction and notation to backpropagation in a fully connectedneural networks is described in Section 8.

When using quantum circuits to implement layers of a neural network, theparameters to update are no longer the individual elements of the weightmatrices directly but may be the angles of the BS gates that give riseto these matrices. Thus, we design an adaptation of the backpropagationmethod to our setting based on the angles.

We start by introducing some notation for a single layer

of the neural network, which is not explicit in the notation forsimplicity. We assume we have as many output qubits as input qubits, butthis can easily be extended to the rectangular case.

We first introduce the notion of timesteps inside each neural networklayer, which correspond to the computational steps in the pyramidalstructure of the circuit (see FIG. 9 ). With the pyramid circuit, for ninputs, there are 2n−3 such timesteps, each one indexed by an integerλ∈[0, . . . , λ_(max)]. Applying a timestep includes applying the matrixw^(λ), which is made of all the BS gates aligned vertically at thistimestep (w^(λ) is the unitary in the unary basis). When a timestep isapplied, the resulting state is a vector in the unary basis named innerlayer and noted by ζ^(λ). This evolution can be written asζ^(λ+1)=w^(λ)·ζ^(λ). We use this notation similar to the real layer l,with the weight matrix

and the resulting vector

(see Section 8).

In fact, we have the correspondences ζ⁰=

for the first inner layer, which is the input of the actual layer, and

=w^(λmax)·ζ^(λmax) for the last timestep. We also have

=w^(λmax) . . . w¹w⁰. We use the same kind of notation for thebackpropagation errors. At each timestep λ we define an inner error

$\delta^{\lambda} = {\frac{\partial C}{\partial\zeta^{\lambda}}.}$

This definition is similar to the layer error

= ∂ C .

In fact, the same backpropagation formulas may be used, without nonlinearities, to retrieve each inner error vectorδ^(λ)=(w^(λ))^(T)·δ^(λ+1). In particular, for the last timestep, thefirst to be calculated, we have δ^(λmax)=(w_(max) ^(λ))^(T)·

. Finally, we can retrieve the error at the previous layer

−1 using the correspondence

=δ⁰⊙σ′^((z) ^(l) ⁾, where ⊙ symbolizes the Hadamard product, orentry-wise multiplication.

The reason for this breakdown into timesteps is the ability toefficiently obtain the gradient with respect to each angle. Let'sconsider one gate with angle θ_(i), acting at the timestep λ on qubits iand i+1. We decompose the gradient

$\frac{\partial C}{\partial\theta_{i}}$

using each component, indexed by the integer k, of the inner layer andinner error vectors:

$\begin{matrix}{\frac{\partial C}{\partial\theta_{i}} = {{\sum\limits_{k}{\frac{\partial C}{\partial\zeta_{k}^{\lambda + 1}}\frac{\partial\zeta_{k}^{\lambda + 1}}{\partial\theta_{i}}}} = {\sum\limits_{k}{\delta_{k}^{\lambda + 1}\frac{\partial\left( {w_{k}^{\lambda} \cdot \zeta^{\lambda}} \right)}{\partial\theta_{i}}}}}} & (8)\end{matrix}$

Since timestep λ is only composed of separated BS gates, the matrixw^(λ) includes in diagonally arranged 2×2 block submatrices given in Eq.(1). Only one of these submatrices depends on the angle θ consideredhere, at the position i and i+1 in the matrix. We can thus rewrite theabove gradient as:

$\begin{matrix}{\frac{\partial C}{\partial\theta_{i}} = {{\delta_{i}^{\lambda + 1}\left( {{{- {\sin\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} + {{\cos\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)} + {\delta_{i + 1}^{\lambda + 1}\left( {{{- {\cos\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} - {{\sin\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)}}} & (9)\end{matrix}$

Therefore, we have shown a way to compute each angle gradient: duringthe feedforward pass, sequentially apply each of the 2n−3=O(n) timestepsand store the resulting vectors (the inner layers ζ^(λ)). During thebackpropagation, obtain the inner errors δ^(λ) by applying the timestepsin reverse. To do this, we “back-propagate” the errors by calculatingfirst the δ^(λ+1) and then δ^(λ), from λ_(max) to 0). Afterwards, agradient descent method may be used on each angle θ_(i), whilepreserving the orthogonality of the overall equivalent weight matrix:

$\left. \theta_{i}\leftarrow{\theta_{i} - {\eta{\frac{\partial C}{\partial\theta_{i}}.}}} \right.$

An interesting aspect of this gradient descent is the fact that theoptimization is performed in the angle landscape, and not on theequivalent weight landscape. These landscapes can potentially bedifferent and hence the optimization can produce different models.

As one can see from the above description, this is a classical algorithmto obtain the angle's gradients, which allows the OrthoNN to be trainedefficiently classically while preserving the strict orthogonality.

To obtain the angle's gradient, 2n−3 inner layers ζ^(λ) may be storedduring the feedforward pass. Next, given the error at the followinglayer, a backward loop on each timestep may be performed (see FIG. 9 ).At each timestep, the gradient for each angle parameter may bedetermined by applying Eq. (9). This uses O(1) operations for eachangle. Since there are at most n/2 angles per timesteps for a pyramidcircuit, estimating gradients has a complexity of O(n²). After eachtimestep, the next inner error δ^(λ−1) is computed as well, using atmost 4n/2 operations.

Thus, this classical algorithm allows the gradients of the n(n−1)/2angles to be computed in O(n²), in order to perform a gradient descentrespecting the strict orthogonality of the weight matrix. This isconsiderably faster than previous methods based on Singular ValueDecomposition methods and provides a training method which is as fast asfor normal neural networks (e.g., see Table 1), while providing theextra property of orthogonality.

5. Numerical Experiments

We performed basic numerical experiments to verify the learningabilities of the pyramidal circuit using a classical simulation. Inthese experiments, we use a dataset of handwritten digits (in this case,the standard MNIST dataset) to compare our pyramidal OrthoNN with an SVBalgorithm.

FIG. 10 is a graph of the numerical experiments. The graph compares astandard neural network 1005 with an approximated OrthoNN 1010 (using aSVB algorithm) and our pyramidal OrthoNNs 1015. The graph shows that thetraining of the classical NN 1005, the classical almost-orthogonal NN1010, and the pyramidal orthogonal-NN 1015 all lead to a good andequivalent learning accuracy. The x-axis is the number of epochs (i.e.,the number of training steps). The y-axis is the accuracy. The unbrokenlines are the accuracy of the “Train set,” which is the set of data usedto update the weights/parameters. The dashed lines are the accuracy ofthe “Test set,” which is the set of data that were not seen during thetraining. The Test set data verifies that the model is not ‘overfitting’to the training data. The reason the pyramid ortho-NN plot 10105converges a bit later is unknown, but this does not seem to be afundamental characteristic. Overall, the plots in FIG. 10 show that ourproposal appears to be valid in at least small-scale classicalsimulations.

6. Significance

This disclosure describes training methods for orthogonal neuralnetworks (OrthoNNs) that run in quadratic time, which is a significantimprovement over previous methods based on Singular Value Decomposition.

One idea of our methods is to replace the usual weights and orthogonalmatrices by an equivalent pyramidal circuit made of two-dimensionalrotations. Each rotation is parametrizable by an angle, and the gradientdescent takes place in the angle's optimization landscape. This uniquetype of gradient backpropagation may ensure a perfect orthogonality ofthe weights matrices while improving the running time compared toprevious works. Moreover, both classical and quantum methods may be usedfor inference, where the forward pass on a near term quantum computingsystem may provide a provable advantage in the running time. Thisdisclosure also expands the field of quantum deep learning byintroducing new tools, concepts, and equivalences with classical deeplearning theory.

7. Description of Orthogonal Neural Networks

The idea behind Orthogonal Neural Networks (OrthoNNs) is to addconstraint to the weight matrices corresponding to the layers of aneural network. Imposing orthogonality to these matrices havetheoretical and practical benefits in the generalization error.Orthogonality may ensure a low weights redundancy and may preserve themagnitude of the weight matrix's eigenvalues to avoid vanishinggradients. In terms of complexity, for a single layer, the feedforwardpass of an OrthoNN is a matrix multiplication, hence has a running timeof O(n²) if n×n is the size of the orthogonal matrix.

A difficulty of OrthoNNs is to preserve the orthogonality of thematrices while updating them during gradient descent. Several algorithmshave been proposed to this end, but they all point that pureorthogonality is computationally hard to conserve.

As used herein, an orthogonal matrix refers to a real square matrixwhose columns and rows are orthonormal vectors. One way to express thisis Q^(T)Q=QQ^(T)=I, where Q^(T) is the transpose of Q and I is theidentity matrix.

8. General Description of Backpropagation for Neural Networks

Backpropagation in a fully connected neural network is an efficientprocedure to update the weight matrix at each layer. At layer

, we note its weight matrices

and biases

. Each layer is followed by a nonlinear function a, and can therefore bewritten as

=σ(

·

+

)=σ(

)  (10)

After the last layer, one can defined a cost function C that comparesthe output to the ground truth. The goal is to calculate the gradient ofC with respect to each weight and bias, namely

∂ C ⁢ and ⁢ ∂ C .

In the backpropagation, the method calculates these gradients for thelast layer, then propagates back to the first layer.

The error vector at layer

may be defined by

= ∂ C .

One can show the backward recursive relation

=(

)^(T)·

⊙

, where ⊙ symbolizes the Hadamard product, or entry-wise multiplication.Note that the previous computation applies the layer (apply matrixmultiplication) in reverse. We can then show that each element of theweight gradient matrix at layer

is given by

∂ C = · .

Similarly, the gradient with respect to the biases is defined as

∂ C = .

Once these gradients are computed, the parameters may be updated usingthe gradient descent rule, with learning rate η (note that η may be thesame or different than η used in Section 4):

← - η ⁢ ∂ C ; ← - η ⁢ ∂ C ( 11 )

9. Preliminaries in Quantum Computing

This section provides a succinct quantum information background that maybe helpful for this work.

9.1 Qubits

In classical computing, a bit can be either 0 or 1. With a quantuminformation perspective, a quantum bit or qubit can be in state |0

, |1

. We use the braket notation |⋅

to specify the quantum nature of the bit. The qubits can be insuperposition of both states α|0

+β|1

where α, β∈

such that |α|²+|β|²=1. The coefficients α and β are called amplitudes.The probabilities of observing either 0 or 1 when measuring the qubitare linked to the amplitudes:

p(0)=|α|² ,p(1)=|β|²  (12)

As quantum physics teaches us, any superposition is possible before themeasurement, which gives special abilities in terms of computation. Witha n qubits, 2^(n) possible binary combinations (e.g. |01 . . . 1001

) can exist simultaneously, each with its own amplitude.

An n qubit system can be represented as a normalized vector in a 2^(n)dimensional Hilbert space. A multiple qubit system is called a quantumregister. If |p

and |q

are two quantum states or quantum registers, the whole system can berepresented as a tensor product |p

⊗|q

, also written as |p

|q

or |p, q

.

9.2 Quantum Computation

As logical gates in classical circuits, qubits or quantum registers areprocessed using quantum gates. A gate is a unitary mapping in theHilbert space, preserving the unit norm of the quantum state vector.Therefore, a quantum gate acting on n qubits is a matrix U∈

² ^(n) such that UU^(†)=U^(†)U=I, with U^(†) being the adjoint, orconjugate transpose, of U.

Common single qubit gates includes the Hadamard gate

$\frac{1}{\sqrt{2}}\begin{pmatrix}1 & 1 \\1 & {- 1}\end{pmatrix}$

that maps

$\left. \left. {\left. \left. {\left. \left. {\left. \left. {❘0} \right\rangle\mapsto{\frac{1}{\sqrt{2}}\left( {❘0} \right.} \right\rangle + {❘1}} \right\rangle \right){and}{❘1}} \right\rangle\mapsto{\frac{1}{\sqrt{2}}\left( {❘0} \right.} \right\rangle - {❘1}} \right\rangle \right),$

creating the quantum superposition, the NOT gate

$\begin{pmatrix}0 & 1 \\1 & 0\end{pmatrix}$

that permutes |0

and |1

, or R_(y) rotation gate parametrized by an angle θ, given by

$\begin{pmatrix}{\cos\left( {\theta/2} \right)} & {{- {s{in}}}\left( {\theta/2} \right)} \\{\sin\left( {\theta/2} \right)} & {\cos\left( {\theta/2} \right)}\end{pmatrix}.$

Common two-qubits gates includes the CNOT gate

$\begin{pmatrix}1 & 0 & 0 & 0 \\0 & 1 & 0 & 0 \\0 & 0 & 0 & 1 \\0 & 0 & 1 & 0\end{pmatrix}$

which is a NOT gate applied on the second qubit only if the first one isin state |1

, or similarly the CZ gate

$\begin{pmatrix}1 & 0 & 0 & 0 \\0 & 1 & 0 & 0 \\0 & 0 & 0 & 1 \\0 & 0 & 0 & {- 1}\end{pmatrix}.$

In this work, we use the BS gate. In some embodiments, this gate can beimplemented either as a native gate, known as FSIM, or using fourHadamard gates, two R_(y) rotation gates, and two two-qubits CZ gates.An example of this circuit is illustrated in FIG. 11

One advantage of quantum gates is their ability to be applied to asuperposition of inputs. Indeed, given a gate U such that U|x

|f(x)

, it can be applied to all possible combinations of x at once

$\left. \left. \left. {U\left( {\frac{1}{c}{\sum_{x}{❘x}}} \right.} \right\rangle \right)\mapsto{\frac{1}{c}{\sum_{x}{❘{f(x)}}}} \right\rangle.$

10. Additional Details on the Quantum Pyramidal Circuit:

10.1 Tomography and Error Mitigation

As shown in FIGS. 7A-7C, when using the quantum circuit, the output is aquantum state |y

=|Wx

. As often in quantum machine learning, it may be important to go allthe way and consider the cost of retrieving classical outputs, using aprocedure called tomography. In our case, this may be important sincebetween each layer, the quantum output is converted into a classical onein order to apply a nonlinear function, and then reloaded for the nextlayer.

10.1.1 Error Mitigation

Before detailing the tomography procedure, it is interesting to noticethat with our restriction to unary states, a strong benefit appears forerror mitigation purposes. Indeed, since we may expect to obtain onlyquantum superposition of unary states at every layer, measurements maybe post processed to discard measurements that include non-unary states(i.e., states with more than one qubit in state |1

, or the ground state). The most expected error is a bit flip between |1

and |0

. The case where two bit flips happened, which would pass through ourerror mitigation, is even less probable.

10.1.2 Tomography

Retrieving the amplitudes of a quantum state comes at cost of multiplemeasurements, which requires to run the circuit multiples times, henceadding a multiplicative overhead in the running time. A finite number ofsamples is also a source of approximation in the final result. In thiswork, we allow for

_(∞) errors. The

_(∞) tomography on a quantum state |y

with unary encoding on n qubits may require O(log(n)/δ²) measurements,where δ>0 is the error threshold allowed. For each j∈[n], [y_(j)] isobtained with an absolute error δ, and if [y_(j)]<δ, it will mostprobably not be measured, hence set to 0. In practice, one would performas many measurements as it is convenient during the experiment anddeduce the equivalent precision δ from the number of measurements made.

In some embodiments, the sign of each component of the vector may bedetermined. Indeed, since we measure probabilities that are the squaremodule of the quantum amplitudes, the sign may not be readily apparent.In the case of neural network, it may be important to obtain the sign ofthe layer's components in order to apply certain types of nonlinearities. For instance, the ReLu activation function is often used toset all negative components to 0.

In FIGS. 12A-12C, we propose specific enhancements (e.g., additionalcircuits) to our pyramid circuit to obtain the signs of the vector'scomponents at low cost. Specifically, FIGS. 12A-C illustrate tomographyprocedures to retrieve the value and the sign of each component of theresulting vector |y

=|Wx

. FIG. 12A illustrates the pyramid circuit previously described. FIGS.12B and 12C illustrate the pyramid circuit with additional circuits.Specifically, the FIGS. 12B and 12C include additional BS gates withangle π/4 to compare the signs between adjacent components. In all threecases an

_(∞) tomography may be applied.

_(∞) tomography is a method that determines how many samples from aquantum state to take to retrieve an approximated description of it. Theapproximation is made with a relative error with respect to the‘infinite norm’ (instead of the usual ‘L2’ or Euclidean norm).

The sign retrieval procedure may include three parts.

The circuit is first applied as described above (e.g., execute thecircuit in FIG. 12A), allowing to retrieve each squared amplitude y_(j)² with precision δ>0 using the

_(∞) tomography. The probability of measuring the unary state |e₁

(i.e. |100 . . .

), is p(e₁)=y₁ ².

The same steps are applied a second time on a modified circuit (e.g.,execute the circuit in FIG. 12B). It has additional BS gates with angleπ/4 at the end (e.g., sign circuit 1205A), which mixes the amplitudespair by pair. The probabilities to measure |e₁

and |e₂

are now given by p(e₁)=(y₁+y₂)² and p(e₂)=(y₁−y₂)². Therefore ifp(e₁)>p(e₂), we have sign(y₁)≠sign(y₂), and if p(e₁)>p(e₂), we havesign(y₁)=sign(y₂). The same holds for the pairs (y₃,y₄), and so on.

The same steps are applied again, except the additional BS gates areshifted by one position below (e.g., execute the circuit in FIG. 12C,including sign circuit 1205B). Then the signs of the pairs are compared:(y₂,y₃), (y₄,y₅), and so on.

Each value y_(j) with its sign may be determined e.g., assuming thaty₁>0. This procedure has the benefit of only adding a constant depth (inother words, it doesn't grow with the number of qubits). In this case,the depth increases by one. However, this process may use three timesmore runs. The overall cost of the tomography procedure with signretrieval is given by Õ(n/δ²).

In FIG. 13 we propose another method to obtain the values of theamplitudes and their signs of each component of the resulting vector |y

=|Wx

. Compared to the above procedure, it executes a single circuit, but itmay require an additional qubit, and the depth of the circuit may be3n+O(1) instead of 2n+O(1). This circuit initializes the qubits in (|0

+|1

)|0

, where the last |0

corresponds to the n qubits that will be processed by the pyramidalcircuit and the loaders. Next, applying the data loader for thenormalized input vector x, the pyramidal circuit, according to Eq.(6),maps the state to:

$\begin{matrix}\left. {\left. {{\left. {❘0} \right\rangle ❘}0} \right\rangle + {❘{1\underset{j = 1}{\overset{n}{\rangle\sum}}{W_{j}x{❘e_{j}}}}}} \right\rangle & (13)\end{matrix}$

Then, we use an additional data loader for the uniform norm-1 vector

$\left( {\frac{1}{\sqrt{n}},\ldots,\frac{1}{\sqrt{n}}} \right).$

Note that this loader is built in the reverse order to fit the pyramidand limit the augmentation of the depth. We also apply the adjoint ofthis loader after a controlled operation on the first extra qubit.Recall that if a circuit U is followed by U^(†), it is equivalent to theidentity. Therefore, this loads the uniform state only in some part ofthe superposition of the extra qubit:

$\begin{matrix}\left. {\left. {\left. {\left. {❘1} \right\rangle{\overset{n}{\sum\limits_{j = 1}}{\frac{1}{\sqrt{n}}{❘e_{j}}}}} \right\rangle + {❘0}} \right\rangle{\overset{n}{\sum\limits_{j = 1}}{W_{j}x{❘e_{j}}}}} \right\rangle & (14)\end{matrix}$

Afterwards, a Hadamard gate mixes both parts of the amplitudes on theextra qubit:

$\begin{matrix}\left. {\left. {\left. {\left. {❘1} \right\rangle{\overset{n}{\sum\limits_{j = 1}}{\left( {\frac{1}{\sqrt{n}} + {W_{j}x}} \right){❘e_{j}}}}} \right\rangle + {❘0}} \right\rangle{\overset{n}{\sum\limits_{j = 1}}{\left( {\frac{1}{\sqrt{n}} - {W_{j}x}} \right){❘e_{j}}}}} \right\rangle & (15)\end{matrix}$

On this state, we can see that the probability of measuring the extraqubit in state 0 and rest in the unary state e_(j) is given by

${p\left( {0,e_{j}} \right)} = {\left( {\frac{1}{\sqrt{n}} + {W_{j}x}} \right)^{2}.}$

Therefore, for each j, if after several measurements we observe

${{p\left( {0,e_{j}} \right)} > \frac{1}{n}},$

we can deduce W_(j)x>0. Having the sign, we can get the value

${W_{j}x} = {{\pm \sqrt{p\left( {0,e_{j}} \right)}} - {\frac{1}{\sqrt{n}}.}}$

Combining with the

_(∞) tomography and the non linearity, the overall cost of thistomography is given by Õ(n/δ²) as well.

10.2 Multiple Quantum Layers

In the previous sections, we have seen how to implement a quantumcircuit to perform the evolution of one orthogonal layer. In classicaldeep learning, such layers are stacked to gain in expressivity andaccuracy. Between each layer, a non-linear function may be applied tothe resulting vector.

The benefit of using our quantum pyramidal circuit is the ability tosimply concatenate them to mimic a multi-layer neural network. Aftereach layer, a tomography of the output state |z

is performed to retrieve each component, corresponding to its quantumamplitudes. A nonlinear function σ is then applied classically to obtaina=σ(z). The next layer starts with a new unary data loader. This schemeallows us to keep the depth of the quantum circuits reasonable for NISQdevices, by applying the neural network layer by layer.

FIGS. 14A and 14B each illustrate an example neural network with layers.FIG. 14A illustrates a classical representation of a neural network withthree layers. The nodes of the layers [8,8,4,4]. FIG. 14B illustratesthe equivalent neural network using a quantum circuit. The circuitincludes a concatenation of multiple pyramidal circuits (gates in thepyramid circuits have parameter θ_(i)). Between each layer a measurementoperation is performed and a non linearity is applied. Additionally,each layer starts with a new unary data loader (gates in the data loadercircuits have parameter α_(i))

In some embodiments, the circuit can include additional entangling gatesafter each pyramid layer (composed for instance of CNOT or CZ). Thiswould mark a step out of the unary basis but may effectively allow toexplore more interactions in the Hilbert Space.

11. Example Circuits

This section describes example quantum circuits with differentarchitectures that can be used to implement a layer of an orthogonalneural network. The details of them are summarized below in Table 2:

TABLE 2 Example circuits to implement a layer of an orthogonal neuralnetwork Example No. Example Name FIG. of Gates Depth ConnectivityPyramid FIG. 18A n(n − 1)/2 2n − 3 Nearest Neighbors Butterfly FIG. 18B nlog(n)/2 log(n) All-to-All Connectivity Brick FIG. 18C n(n − 1)/2 nNearest Neighbors V FIG. 18D 2n − 3 2n − 3 Nearest Neighbors X FIG. 18E2n − 3  n − 1 Nearest Neighbors

Descriptions of the circuits listed in Table 2 are provided below.

The pyramid circuit is described in other sections. An example pyramidcircuit is illustrated in FIG. 18A.

The butterfly circuit was inspired by the butterfly circuits of theCooley-Tukey FFT algorithm. The butterfly circuit described herein is anefficient way to characterize a reduced yet powerful class of orthogonallayers. This circuit is a low depth circuit as compared to others(log(n) depth). The butterfly layer does not characterize all theorthogonal matrices (with determinant 1) due to reduced number ofparameters (n log(n)/2) but still covers a class of orthogonal matrices,like the unary Fourier Transform. This circuit may require all-to-allqubit connectivity. A parallel data loader may be preferred with thiscircuit.

An example butterfly circuit is illustrated in FIG. 18B. The verticallines in FIG. 18B represent BS gates. The layers of the butterflycircuit in FIG. 18B may be described as follows. The first layer appliesa BS gate the first qubit and the fifth qubit. The second layer appliesa BS gate to the second qubit and the sixth qubit. The third layerapplies a BS gate to the third qubit and a seventh qubit. The fourthlayer applies a BS gate to the fourth qubit and the eighth qubit (theeighth qubit is the last qubit in this example). The fifth layer appliesa BS gate to the first qubit and the third qubit and a BS gate to thefifth qubit and the seventh qubit. The sixth layer applies a BS gate tothe second qubit and the fourth qubit and a BS gate to the sixth qubitand the eighth qubit. The seventh layer includes: a BS gate applied tothe first qubit and the second qubit, a BS gate applied to the thirdqubit and the fourth qubit, a BS gate applied to the fifth qubit and thesixth qubit, and a BS gate applied to the seventh qubit and the eighthqubit. However, other arrangements are possible depending on thepositions of the qubits in the diagram.

The brick circuit is the most depth efficient orthogonal layer circuitwith BS gates in Table 2 which can characterize the entire class oforthogonal matrices with determinant 1. The brick circuit may have thesame number of parameters as the Pyramid circuit (n(n−1)/2) but abouthalf the depth. Some embodiments of the brick circuit use nearestneighbor qubit connectivity. However, loading data using a data loadermay add an additional depth (e.g., n/2 for a semi-diagonal loader orlog(n) for a parallel loader). In many cases, the brick circuit may bepreferred (e.g., optimal) due to its small depth.

An example brick circuit is illustrated in FIG. 18C. The layers of thebrick circuit in FIG. 18C may be described as follows: the first layerapplies a BS gate to qubits 1 and 2, a BS gate to qubits 3 and 4, a BSgate to qubits 5 and 6, and a BS gate to qubits 7 and 8. The secondlayer applies a BS gate to qubits 2 and 3, a BS gate to qubits 4 and 5,and a BS gate to qubits 6 and 7. The following layers have a similararrangement as the first two layers. In FIG. 18C this results in the BSgates forming a rectangle shape. However, other shapes are possibledepending on the positions of the qubits in the diagram.

The V circuit is a reduced version of the pyramid circuits. The Vcircuit is designed for NISQ hardware. This layer provides a path fromevery qubit to every qubit but has only linear parameters (2n−3). It maybe preferred to use a diagonal data loader with the V circuit.

An example V circuit is illustrated in FIG. 18D. The layers of the Vcircuit FIG. 18D may be described as follows: the first layer applies aBS gate to qubits 1 and 2. The second layer applies a BS gate to qubits2 and 3. This pattern continues until the seventh layer applies a BSgate to qubits 7 and 8 (qubit 8 is the last quibt in this example). Theeight layer applies a BS gate to qubits 6 and 7. The ninth layer appliesa BS gate to qubits 5 and 6. This pattern continues until the last layerapplies a BS gate to qubits 1 and 2. In FIG. 18D this results in the BSgates forming a V shape. However, other shapes are possible depending onthe positions of the qubits in the diagram.

The X circuit (not to be confused with an X gate) is a reduced versionof the brick circuit. The X circuit is designed for NISQ hardware. Thislayer provides a path from every qubit to every qubit but has onlylinear parameters (2n−3). The additional loader depth may be the same asbrick circuit.

An example X circuit is illustrated in FIG. 18E. The layers of the Xcircuit FIG. 18E may be described as follows: the first layer applies aBS gate to qubits 1 and 2 and a BS gate to qubits 7 and 8 (qubit 8 isthe last quibt in this example). The second layer applies a BS gate toqubits 2 and 3 and a BS gate to qubits 6 and 7. The third layer appliesa BS gate to qubits 3 and 4 and a BS gate to qubits 5 and 6. The fourthlayer applies a BS gate to qubits 4 and 5. The pattern continues inreverse order until the last layer applies a BS gate to qubits 1 and 2and a BS gate to qubits 7 and 8. In FIG. 18E this results in the BSgates forming an X shape. However other shapes are possible depending onthe positions of the qubits in the diagram.

The training methods for the above circuits may be the same as describedin Section 4. For example, we go inner-layer-by-inner-layer and updateeach angle using the same update rule as described above for the pyramidlayers.

As stated above, the above circuits provide a path from each qubit toevery other qubit. For example, looking at one of the circuit diagramsin FIGS. 18A-18E, one can trace a set of paths from one qubit to anyother qubit using the horizontal lines (wires) and the vertical lines(BS gates). Each set of paths from one qubit to another corresponds toone element of the orthogonal matrix (e.g., the matrix corresponding tothe quantum circuit, restricted to the ‘unary’ basis). If a path fromone qubit to another is not available, the matching element of thematrix is not tunable. This helps implement a layer of a fully connectedneural network since each input node is connected to each output node ina fully connected neural network layer.

The above circuits may characterize the special orthogonal group, i.e.,the orthogonal matrices with determinant +1. They may be generalized toincorporate the ones with determinant −1 as well by applying a Z gate inthe end on the last qubit.

11.1 Performing Unary Fourier Transform using Butterfly Circuits

Classically, the matrix that implements a Fourier transform (FFT) isgiven by:

$W = {\frac{1}{\sqrt{N}}\begin{bmatrix}1 & 1 & 1 & 1 & \cdots & 1 \\1 & \omega & \omega^{2} & \omega^{3} & \cdots & \omega^{N - 1} \\1 & \omega^{2} & \omega^{4} & \omega^{6} & \cdots & \omega^{2{({N - 1})}} \\1 & \omega^{3} & \omega^{6} & \omega^{9} & \cdots & \omega^{3{({N - 1})}} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\1 & \omega^{N - 1} & \omega^{2{({N - 1})}} & \omega^{3{({N - 1})}} & \cdots & \omega^{{({N - 1})}{({N - 1})}}\end{bmatrix}}$

where the omegas are roots of unity.

A Fourier Transform in the unary domain may be performed by using thebutterfly circuit architecture with one additional single qubit gate perBS gate. FIG. 19 is a diagram of a circuit for unary Fourier Transformusing butterfly circuits. The circuit includes BS gates (vertical lines)all with fixed angles

$\frac{\pi}{4}.$

The circuit also includes also uses another type of one-qubit rotationgate represented by a white square with −ω^(k) in it, for which thematrix is given by [[1, 0], [0, −ω^(k)]], where ω is the correspondingroot of unity. Thus, the circuit in FIG. 19 is a quantum circuit forwhich the unitary matrix (restricted to the unary basis) is exactly theFFT matrix.

12. Example Methods

12.1 Example Method of Executing a Quantum Circuit to Implement a Layerof a Neural Network.

FIG. 16 is a flowchart of an example method for executing a quantumcircuit to implement a layer of a neural network. The layer of theneural network has n>0 input nodes, d>0 output nodes, and an orthogonalweight matrix. The method is described as being performed by a computingsystem that, for example, includes a classical computing system andquantum computing system. The computing system may perform theoperations of the method by executing instructions stored on anon-transitory computer readable storage medium.

The computing system executes 1610 at least O(log(n)) layers of thequantum circuit that apply BS gates, each BS gate being a singleparameterized two-qubit gate, the number of BS gates being equal to thenumber of degrees of freedom of the orthogonal weight matrix. In someembodiments, the BS gates are applied to x>0 qubits of a quantumcomputer. In some embodiments, execution of the at least O(log(n) layersof the quantum circuit is performed by a classical computer simulating aquantum computer (e.g., see Section 3.1 for more information). In someembodiments, the layer of the neural network is fully connected. In someembodiments, n=d.

The number of BS gates in the O(log(n)) layers may be equal to(2n−1−d)*d/2 (e.g., see Section 2 for more information). In someembodiments, the O(log(n)) layers only include BS gates (e.g., see FIGS.2A and 2A and related descriptions). Additionally, or alternatively, thenumber of qubits x equals the number of input nodes n of the layer ofthe neural network (e.g., see FIGS. 2A-3B and related descriptions).

In some embodiments, each BS gate is applied to adjacent qubits of thequantum computer. Adjacent qubits may refer to nearest neighbor qubitson a qubit register of the quantum computer. Generally, a pair ofnon-adjacent qubits are qubits that are far enough apart, or withsufficiently many obstructing qubits or other components between them,that the mechanism used to couple the qubits in the physical platformthey are implemented with does not work to implement a two-qubitinteraction directly between the pair without some modification to thecoupling procedure or hardware. Adjacent qubits on a qubit register maybe adjacent to each other on a circuit diagram. For example, withrespect to FIG. 2A, qubit 2 may be adjacent to qubits 1 and 3 on a qubitregister.

In some embodiments, the at least O(log(n) layers apply BS gates to thequbits according to a pyramid pattern (e.g., see FIG. 2A). In someembodiments, the pyramid pattern includes: a first layer including afirst BS gate applied to a first qubit and a second qubit; a secondlayer including a second BS gate applied to the second qubit and a thirdqubit; a third layer including a third BS gate applied to the firstqubit and the second qubit, and a fourth BS gate applied to the thirdqubit and a fourth qubit; and a fourth layer including a fifth BS gateapplied to the second qubit and the third qubit, and a sixth BS gateapplied to the fourth qubit and a firth qubit. This pattern may continueuntil a BS gate is applied to a final qubit (e.g., qubit 8 in FIG. 2A).After that, layers of BS gates may be applied to the qubits in a reverseorder. For example, the pyramid pattern includes a last layer thatapplies a BS gate to the first and second qubits, a second to last layerthat applies a BS gate to the second and third qubits, etc.

In some embodiments, the computing system prepares a unary quantum stateon the x qubits of the quantum computer (e.g., by executing quantumgates), the unary quantum state corresponding to input data (e.g., avector) to be applied to the layer of the neural network (e.g., seeSection 2.3 for more information). The unary quantum state may be asuperposition of unary states corresponding to the input data vector. Insome embodiments, an output quantum state formed on the x qubits byexecuting the at least O(log(n) layers is also a unary quantum state,the output unary quantum state corresponding to output data of the layerof the neural network (e.g., see Section 3 for more information). Insome embodiments, the computing system prepares the unary quantum stateon the x qubits by: executing a first layer that applies an X gate toone of the x qubits; and after executing the first layer, executing n−1layers that apply n−1 BS gates to the x qubits of the quantum computer(e.g., see FIG. 4A and related description for more information). Atleast one of: (1) the first layer or (2) at least one of the n−1 layersmay be executed before one of the O(log(n) layers is executed (e.g., asdescribed with respect to FIG. 4A, an X gate and a gate of the dataloader circuit 405 are executed before the first gate of the pyramidcircuit 410). In some embodiments, one or more of the O(log(n) layersare executed concurrently with one or more of the n−1 layers (e.g., withrespect to FIG. 4A, some gates of the data loader circuit 405 areexecuted concurrently with some gates of the pyramid circuit 410). Insome embodiments, the n−1 layers apply the n−1 BS gates according to alinear cascade pattern (e.g., see circuit 405 in FIG. 4A).

12.2 Example Method of Training a Layer of a Neural Network.

FIG. 17 is a flowchart of an example method 1700 for training a layer ofa neural network with an orthogonal weight matrix. The steps of method1700 may be performed in different orders, and the method may includedifferent, additional, or fewer steps. The method is described as beingperformed by a computing system that, for example, includes a classicalcomputing system and quantum computing system. The computing system mayperform the operations of the method by executing instructions stored ona non-transitory computer readable storage medium. Furthermore, a neuralnetwork manufactured by the steps of method 1700 may be stored on anon-transitory computer readable storage medium. See Section 4 for moreinformation on training a quantum neural network.

The computing system executes 1710 layers of BS gates of a quantumcircuit. Each BS gate is a single parameterized two-qubit gate. Weightsof the weight matrix are based on values of parameters of the BS gates.In some embodiments, a quantum computing system executes the layers ofthe BS gates of the quantum circuit.

The computing system determines 1720 gradients of a cost function withrespect to parameters of the BS gates of the quantum circuit.

The computing system updates 1730 values of parameters of the BS gatesof the quantum circuit based on the gradients of the cost function. Theupdated values of the parameters preserve the orthogonality of theweight matrix.

In some embodiments, determining gradients of the cost functioncomprises determining gradients of the cost function with respect to theparameter of each BS gate of the quantum circuit.

In some embodiments, executing layers of BS gates of the quantum circuitincludes the computing system measuring a resulting quantum state ζ^(λ)after each layer λ of the quantum circuit is executed. In someembodiments, the computing system determines errors δ for layers λ ofthe quantum circuit. In some embodiments, determining errors δ forlayers λ of the quantum circuit comprises the computing systemdetermining errors for each layer of the quantum circuit in reverseorder according to: δ^(λ)=(w^(λ))_(T)·δ^(λ+1), where δ^(λ) is the errorfor layer λ of the quantum circuit and w^(λ) is a matrix representationof BS gates in layer λ of the quantum circuit. The gradient of the costfunction C with respect to a parameter θ_(i) of a BS gate acting onqubits i and i+1 may be defined by:

$\frac{\partial C}{\partial\theta_{i}} = {{\delta_{i}^{\lambda + 1}\left( {{{- {\sin\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} + {{\cos\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)} + {{\delta_{i + 1}^{\lambda + 1}\left( {{{- {\cos\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} - {{\sin\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)}.}}$

In some embodiments, updating values of the parameters of the BS gatesof the quantum circuit based on the gradients of the cost functionincludes the computing system: updating a value of a parameter θ_(i) ofa BS gate of the quantum circuit according to

$\left. \theta_{i}\leftarrow{\theta_{i} - {\eta\frac{\partial C}{\partial\theta_{i}}}} \right.,$

where η is a learning rate.

13. Description of a Computing System

FIG. 20A is a block diagram that illustrates an embodiment of acomputing system 2000. In the example of FIG. 20A, the computing system2000 includes a classical computing system 2010 (also referred to as anon-quantum computing system) and a quantum computing system 2020,however a computing system may just include a classical computing systemor a quantum computing system. The classical computing system 2010 maycontrol the quantum computing system 2020. An embodiment of theclassical computing system 2010 is described further with respect toFIG. 15 . While the classical computing system 2010 and quantumcomputing system 2020 are illustrated together, they may be physicallyseparate systems (e.g., in a cloud architecture). In other embodiments,the computing system 2000 includes different or additional elements(e.g., multiple quantum computing systems 2020). In addition, thefunctions may be distributed among the elements in a different mannerthan described.

FIG. 20B is a block diagram that illustrates an embodiment of thequantum computing system 2020. The quantum computing system 2020includes any number of quantum bits (“qubits”) 2050 and associated qubitcontrollers 2040. As illustrated in FIG. 20C, the qubits 150 may be in aqubit register of the quantum computing system 2020. Qubits are furtherdescribed below. A qubit controller 2040 is a module that controls oneor more qubits 2050. A qubit controller 2040 may include a classicalprocessor such as a CPU, GPU, or FPGA. A qubit controller 2040 mayperform physical operations on one or more qubits 2050 (e.g., it canperform quantum gate operations on a qubit 2040). In the example of FIG.20B, a separate qubit controller 2040 is illustrated for each qubit2050, however a qubit controller 2050 may control multiple (e.g., all)qubits 2050 of the quantum computing system 2020 or multiple controllers2050 may control a single qubit. For example, the qubit controllers 2050can be separate processors, parallel threads on the same processor, orsome combination of both. In other embodiments, the quantum computingsystem 2020 includes different or additional elements. In addition, thefunctions may be distributed among the elements in a different mannerthan described.

FIG. 20D is a flow chart that illustrates an example execution of aquantum routine on the computing system 2000. The classical computingsystem 2010 generates 2060 a quantum program to be executed or processedby the quantum computing system 2020. The quantum program may includeinstructions or subroutines to be performed by the quantum computingsystem 2020. In an example, the quantum program is a quantum circuit.This program can be represented mathematically in a quantum programminglanguage or intermediate representation such as QASM or Quil.

The quantum computing system 2020 executes 2065 the program and computes2070 a result (referred to as a shot or run). Computing the result mayinclude performing a measurement of a quantum state generated by thequantum computing system 2020 that resulted from executing the program.Practically, this may be performed by measuring values of one or more ofthe qubits 2050. The quantum computing system 2020 typically performsmultiple shots to accumulate statistics from probabilistic execution.The number of shots and any changes that occur between shots (e.g.,parameter changes)) may be referred to as a schedule. The schedule maybe specified by the program. The result (or accumulated results) isrecorded 2075 by the classical computing system 2010. Results may bereturned after a termination condition is met (e.g., a threshold numberof shots occur).

FIG. 21 is an example architecture of a classical computing system 2010,according to an embodiment. The quantum computing system 2020 may alsohave one or more components described with respect to FIG. 21 . AlthoughFIG. 21 depicts a high-level block diagram illustrating physicalcomponents of a computer system used as part or all of one or moreentities described herein, in accordance with an embodiment. A computermay have additional, less, or variations of the components provided inFIG. 21 . Although FIG. 21 depicts a computer 2100, the figure isintended as functional description of the various features which may bepresent in computer systems than as a structural schematic of theimplementations described herein. In practice, and as recognized bythose of ordinary skill in the art, items shown separately could becombined and some items could be separated.

Illustrated in FIG. 21 are at least one processor 2102 coupled to achipset 2104. Also coupled to the chipset 2104 are a memory 2106, astorage device 2108, a keyboard 2110, a graphics adapter 2112, apointing device 2114, and a network adapter 2116. A display 2118 iscoupled to the graphics adapter 2112. In one embodiment, thefunctionality of the chipset 2104 is provided by a memory controller hub2120 and an I/O hub 2122. In another embodiment, the memory 2106 iscoupled directly to the processor 2102 instead of the chipset 2104. Insome embodiments, the computer 2100 includes one or more communicationbuses for interconnecting these components. The one or morecommunication buses optionally include circuitry (sometimes called achipset) that interconnects and controls communications between systemcomponents.

The storage device 2108 is any non-transitory computer-readable storagemedium, such as a hard drive, compact disk read-only memory (CD-ROM),DVD, or a solid-state memory device or other optical storage, magneticcassettes, magnetic tape, magnetic disk storage or other magneticstorage devices, magnetic disk storage devices, optical disk storagedevices, flash memory devices, or other non-volatile solid state storagedevices. Such a storage device 2108 can also be referred to aspersistent memory. The pointing device 2114 may be a mouse, track ball,or other type of pointing device, and is used in combination with thekeyboard 2110 to input data into the computer 2100. The graphics adapter2112 displays images and other information on the display 2118. Thenetwork adapter 2116 couples the computer 2100 to a local or wide areanetwork.

The memory 2106 holds instructions and data used by the processor 2102.The memory 2106 can be non-persistent memory, examples of which includehigh-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM,EEPROM, flash memory.

As is known in the art, a computer 2100 can have different or othercomponents than those shown in FIG. 21 . In addition, the computer 2100can lack certain illustrated components. In one embodiment, a computer2100 acting as a server may lack a keyboard 2110, pointing device 2114,graphics adapter 2112, or display 2118. Moreover, the storage device2108 can be local or remote from the computer 2100 (such as embodiedwithin a storage area network (SAN)).

As is known in the art, the computer 2100 is adapted to execute computerprogram modules for providing functionality described herein. As usedherein, the term “module” refers to computer program logic utilized toprovide the specified functionality. Thus, a module can be implementedin hardware, firmware, or software. In one embodiment, program modulesare stored on the storage device 2108, loaded into the memory 2106, andexecuted by the processor 302.

Referring back to FIGS. 20A-20C, the quantum computing system 2020exploits the laws of quantum mechanics in order to perform computations.A quantum processing device, quantum computer, quantum processor, andquantum processing unit are each examples of a quantum computing system.A quantum computing system can be a universal or a non-universal quantumprocessing device (a universal quantum device can execute any possiblequantum circuit (subject to the constraint that the circuit doesn't usemore qubits than the quantum device possesses)). Quantum processingdevices commonly use so-called qubits, or quantum bits. While aclassical bit always has a value of either 0 or 1, a qubit is a quantummechanical system that can have a value of 0, 1, or a superposition ofboth values. Example physical implementations of qubits includesuperconducting qubits, spin qubits, trapped ions, arrays of neutralatoms, and photonic systems (e.g., photons in waveguides). For thepurposes of this disclosure, a qubit may be realized by a singlephysical qubit or as an error-protected logical qubit that itselfcomprises multiple physical qubits. The disclosure is also not specificto qubits. The disclosure may be generalized to apply to quantumcomputing systems whose building blocks are qudits (d-level quantumsystems, where d>2) or quantum continuous variables, rather than qubits.

A quantum circuit is an ordered collection of one or more gates. Asub-circuit may refer to a circuit that is a part of a larger circuit. Agate represents a unitary operation performed on one or more qubits.Quantum gates may be described using unitary matrices. The depth of aquantum circuit is the least number of steps needed to execute thecircuit on a quantum computing system. The depth of a quantum circuitmay be smaller than the total number of gates because gates acting onnon-overlapping subsets of qubits may be executed in parallel. A layerof a quantum circuit may refer to a step of the circuit, during whichmultiple gates may be executed in parallel. In some embodiments, aquantum circuit is executed by a quantum computing system. In this sensea quantum circuit can be thought of as comprising a set of instructionsor operations that a quantum computing system can execute. To execute aquantum circuit on a quantum computing system, a user may inform thequantum computing system what circuit is to be executed. A quantumcomputing system may include both a core quantum device and a classicalperipheral/control device (e.g., a qubit controller) that is used toorchestrate the control of the quantum device. It is to this classicalcontrol device that the description of a quantum circuit may be sentwhen one seeks to have a quantum computer execute a circuit.

A variational quantum circuit may refer to a parameterized quantumcircuit that is executed many times, where each time some of theparameter values may be varied. The parameters of a parameterizedquantum circuit may refer to parameters of the gate unitary matrices.For example, a gate that performs a rotation about the y axis may beparameterized by a real number that describes the angle of the rotation.Variational quantum algorithms are a class of hybrid quantum-classicalalgorithm in which a classical computer is used to choose and vary theparameters of a variational quantum circuit. Typically, the classicalprocessor updates the variational parameters based on the outcomes ofmeasurements of previous executions of the parameterized circuit.

The description of a quantum circuit to be executed on one or morequantum computers may be stored in a non-transitory computer-readablestorage medium. The term “computer-readable storage medium” should betaken to include a single medium or multiple media (e.g., a centralizedor distributed database, or associated caches and servers) able to storeinstructions. The term “computer-readable medium” shall also be taken toinclude any medium that is capable of storing instructions for executionby the quantum computing system and that cause the quantum computingsystem to perform any one or more of the methodologies disclosed herein.The term “computer-readable medium” includes, but is not limited to,data repositories in the form of solid-state memories, optical media,and magnetic media.

The approaches described above may be amenable to a cloud quantumcomputing system, where quantum computing is provided as a sharedservice to separate users. One example is described in patentapplication Ser. No. 15/446,973, “Quantum Computing as a Service,” whichis incorporated herein by reference.

14. Additional Considerations

The disclosure above describes example embodiments for purposes ofillustration only. Any features that are described as essential,important, or otherwise implied to be required should be interpreted asonly being required for that embodiment and are not necessarily includedin other embodiments.

Additionally, the above disclosure often uses the phrase “we” (and othersimilar phases) to reference an entity that is performing an operation(e.g., a step in an algorithm). These phrases are used for convenience.These phrases may refer to a computing system (e.g., including aclassical computing system and a quantum computing system) that isperforming the described operations.

Some portions of above description describe the embodiments in terms ofalgorithmic processes or operations. These algorithmic descriptions andrepresentations are commonly used by those skilled in the computing artsto convey the substance of their work effectively to others skilled inthe art. These operations, while described functionally,computationally, or logically, are understood to be implemented bycomputer programs comprising instructions for execution by a processoror equivalent electrical circuits, microcode, or the like. Furthermore,it has also proven convenient at times, to refer to these arrangementsof functional operations as modules, without loss of generality. In somecases, a module can be implemented in hardware, firmware, or software.

As used herein, any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment. Similarly, use of “a” or “an” preceding an element orcomponent is done merely for convenience. This description should beunderstood to mean that one or more of the elements or components arepresent unless it is obvious that it is meant otherwise. As used herein,the terms “comprises,” “comprising,” “includes,” “including,” “has,”“having” or any other variation thereof, are intended to cover anon-exclusive inclusion. For example, a process, method, article, orapparatus that comprises a list of elements is not necessarily limitedto only those elements but may include other elements not expresslylisted or inherent to such process, method, article, or apparatus.Further, unless expressly stated to the contrary, “or” refers to aninclusive or and not to an exclusive or. For example, a condition A or Bis satisfied by any one of the following: A is true (or present) and Bis false (or not present), A is false (or not present) and B is true (orpresent), and both A and B are true (or present).

In addition, use of the “a” or “an” are employed to describe elementsand components of the embodiments. This is done merely for convenienceand to give a general sense of the disclosure. This description shouldbe read to include one or at least one and the singular also includesthe plural unless it is obvious that it is meant otherwise. Where valuesare described as “approximate” or “substantially” (or theirderivatives), such values should be construed as accurate +/−10% unlessanother meaning is apparent from the context. From example,“approximately ten” should be understood to mean “in a range from nineto eleven.”

Alternative embodiments are implemented in computer hardware, firmware,software, and/or combinations thereof. Implementations can beimplemented in a computer program product tangibly embodied in amachine-readable storage device for execution by a programmableprocessor; and method steps can be performed by a programmable processorexecuting a program of instructions to perform functions by operating oninput data and generating output. As used herein, ‘processor’ may referto one or more processors. Embodiments can be implemented advantageouslyin one or more computer programs that are executable on a programmablesystem including at least one programmable processor coupled to receivedata and instructions from, and to transmit data and instructions to, adata storage system, at least one input device, and at least one outputdevice. Each computer program can be implemented in a high-levelprocedural or object-oriented programming language, or in assembly ormachine language if desired; and in any case, the language can be acompiled or interpreted language. Suitable processors include, by way ofexample, both general and special purpose microprocessors. Generally, aprocessor will receive instructions and data from a read-only memoryand/or a random-access memory. Generally, a computer will include one ormore mass storage devices for storing data files; such devices includemagnetic disks, such as internal hard disks and removable disks;magneto-optical disks; and optical disks. Storage devices suitable fortangibly embodying computer program instructions and data include allforms of non-volatile memory, including by way of example semiconductormemory devices, such as EPROM, EEPROM, and flash memory devices;magnetic disks such as internal hard disks and removable disks;magneto-optical disks; and CD-ROM disks. Any of the foregoing can besupplemented by, or incorporated in, ASICs (application-specificintegrated circuits) and other forms of hardware.

Although the above description contains many specifics, these should notbe construed as limiting the scope of the invention but merely asillustrating different examples. It should be appreciated that the scopeof the disclosure includes other embodiments not discussed in detailabove. Various other modifications, changes, and variations which willbe apparent to those skilled in the art may be made in the arrangement,operation, and details of the methods and apparatuses disclosed hereinwithout departing from the spirit and scope of the invention.

What is claimed is:
 1. A method for training a layer of a neural networkwith an orthogonal weight matrix, the method comprising: executinglayers of BS gates of a quantum circuit, each BS gate being a singleparameterized two-qubit gate, weights of the weight matrix being basedon values of parameters of the BS gates; determining gradients of a costfunction with respect to parameters of the BS gates of the quantumcircuit; updating values of parameters of the BS gates of the quantumcircuit based on the gradients of the cost function, updated values ofthe parameters preserving the orthogonality of the weight matrix.
 2. Themethod of claim 1, wherein determining gradients of the cost functioncomprises determining gradients of the cost function with respect to theparameter of each BS gate of the quantum circuit.
 3. The method of claim1, wherein executing layers of BS gates of the quantum circuitcomprises: measuring a resulting quantum state ζ^(λ) after each layer λof the quantum circuit is executed.
 4. The method of claim 3, furthercomprising determining errors δ for layers λ of the quantum circuit. 5.The method of claim 4, wherein determining errors δ for layers λ of thequantum circuit comprises determining errors for each layer of thequantum circuit in reverse order according to:δ^(λ)=(w ^(λ))^(T)·δ^(λ+1), where δ^(λ) is the error for layer λ of thequantum circuit and w^(λ) is a matrix representation of BS gates inlayer λ of the quantum circuit.
 6. The method of claim 5, wherein thegradient of the cost function C with respect to a parameter θ_(i) of aBS gate acting on qubits i and i+1 is defined by:$\frac{\partial C}{\partial\theta_{i}} = {{{\delta_{i}^{\lambda + 1}\left( {{{- {\sin\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} + {{\cos\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)} + {\delta_{i + 1}^{\lambda + 1}\left( {{{- {\cos\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} - {{\sin\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)}}.}$7. The method of claim 6, wherein updating values of the parameters ofthe BS gates of the quantum circuit based on the gradients of the costfunction comprises: updating a value of a parameter θ_(i) of a BS gateof the quantum circuit according to$\left. \theta_{i}\leftarrow{\theta_{i} - {\eta\frac{\partial C}{\partial\theta_{i}}}} \right.,$where η is a learning rate.
 8. The method of claim 1, wherein a quantumcomputing system executes the layers of the BS gates of the quantumcircuit.
 9. A non-transitory computer-readable storage medium comprisingstored instructions that, when executed by a computing system, cause thecomputing system to perform operations including: executing layers of BSgates of a quantum circuit, each BS gate being a single parameterizedtwo-qubit gate, weights of the weight matrix being based on values ofparameters of the BS gates; determining gradients of a cost functionwith respect to parameters of the BS gates of the quantum circuit;updating values of parameters of the BS gates of the quantum circuitbased on the gradients of the cost function, updated values of theparameters preserving the orthogonality of the weight matrix.
 10. Thenon-transitory computer-readable storage medium of claim 9, whereindetermining gradients of the cost function comprises determininggradients of the cost function with respect to the parameter of each BSgate of the quantum circuit.
 11. The non-transitory computer-readablestorage medium of claim 9, wherein executing layers of BS gates of thequantum circuit comprises: measuring a resulting quantum state ζ^(λ)after each layer λ of the quantum circuit is executed.
 12. Thenon-transitory computer-readable storage medium of claim 11, furthercomprising determining errors δ for layers λ of the quantum circuit. 13.The non-transitory computer-readable storage medium of claim 12, whereindetermining errors δ for layers λ of the quantum circuit comprisesdetermining errors for each layer of the quantum circuit in reverseorder according to:δ^(λ)=(w ^(λ))^(T)·δ^(λ+1), where δ^(λ) is the error for layer λ of thequantum circuit and w^(λ) is a matrix representation of BS gates inlayer λ of the quantum circuit.
 14. The non-transitory computer-readablestorage medium of claim 13, wherein the gradient of the cost function Cwith respect to a parameter θ_(i) of a BS gate acting on qubits i andi+1 is defined by:$\frac{\partial C}{\partial\theta_{i}} = {{{\delta_{i}^{\lambda + 1}\left( {{{- {\sin\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} + {{\cos\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)} + {\delta_{i + 1}^{\lambda + 1}\left( {{{- {\cos\left( \theta_{i} \right)}}\zeta_{i}^{\lambda}} - {{\sin\left( \theta_{i} \right)}\zeta_{i + 1}^{\lambda}}} \right)}}.}$15. The non-transitory computer-readable storage medium of claim 14,wherein updating values of the parameters of the BS gates of the quantumcircuit based on the gradients of the cost function comprises: updatinga value of a parameter θ_(i) of a BS gate of the quantum circuitaccording to$\left. \theta_{i}\leftarrow{\theta_{i} - {\eta\frac{\partial C}{\partial\theta_{i}}}} \right.,$where η is a learning rate.
 16. The non-transitory computer-readablestorage medium of claim 9, wherein a quantum computing system executesthe layers of the BS gates of the quantum circuit.
 17. A neural networkstored on a non-transitory computer readable storage medium, wherein theneural network is manufactured by a process comprising: executing layersof BS gates of a quantum circuit, each BS gate being a singleparameterized two-qubit gate, weights of the weight matrix being basedon values of parameters of the BS gates; determining gradients of a costfunction with respect to parameters of the BS gates of the quantumcircuit; updating values of parameters of the BS gates of the quantumcircuit based on the gradients of the cost function, updated values ofthe parameters preserving the orthogonality of the weight matrix. 18.The neural network of claim 17, wherein determining gradients of thecost function comprises determining gradients of the cost function withrespect to the parameter of each BS gate of the quantum circuit.
 19. Theneural network of claim 17, wherein executing layers of BS gates of thequantum circuit comprises: measuring a resulting quantum state ζ^(λ)after each layer λ of the quantum circuit is executed.
 20. The neuralnetwork of claim 19, further comprising determining errors δ for layersλ of the quantum circuit.